class: center, middle, inverse, title-slide # Insurance pricing analytics with R ## Workshop on Pricing Analytics
### Katrien Antonio, Roel Henckaerts and Roel Verbelen ###
Workshop pricing analytics
| March, 2020 --- class: inverse, center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: introduction # Introduction ### Workshop <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> https://github.com/katrienantonio/PE-pricing-analytics The workshop repo on GitHub, where we upload presentation, code, data sets, etc. -- ### Us <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> [https://katrienantonio.github.io/](https://katrienantonio.github.io/) <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> [katrien.antonio@kuleuven.be](mailto:katrien.antonio@kuleuven.be) & [roel.henckaerts@kuleuven.be](mailto:roel.henckaerts@kuleuven.be) & [roel.verbelen@gmail.com](mailto:roel.verbelen@gmail.com) <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 640 512"><path d="M622.34 153.2L343.4 67.5c-15.2-4.67-31.6-4.67-46.79 0L17.66 153.2c-23.54 7.23-23.54 38.36 0 45.59l48.63 14.94c-10.67 13.19-17.23 29.28-17.88 46.9C38.78 266.15 32 276.11 32 288c0 10.78 5.68 19.85 13.86 25.65L20.33 428.53C18.11 438.52 25.71 448 35.94 448h56.11c10.24 0 17.84-9.48 15.62-19.47L82.14 313.65C90.32 307.85 96 298.78 96 288c0-11.57-6.47-21.25-15.66-26.87.76-15.02 8.44-28.3 20.69-36.72L296.6 284.5c9.06 2.78 26.44 6.25 46.79 0l278.95-85.7c23.55-7.24 23.55-38.36 0-45.6zM352.79 315.09c-28.53 8.76-52.84 3.92-65.59 0l-145.02-44.55L128 384c0 35.35 85.96 64 192 64s192-28.65 192-64l-14.18-113.47-145.03 44.56z"/></svg> (Katrien) Professor in insurance data science <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 640 512"><path d="M622.34 153.2L343.4 67.5c-15.2-4.67-31.6-4.67-46.79 0L17.66 153.2c-23.54 7.23-23.54 38.36 0 45.59l48.63 14.94c-10.67 13.19-17.23 29.28-17.88 46.9C38.78 266.15 32 276.11 32 288c0 10.78 5.68 19.85 13.86 25.65L20.33 428.53C18.11 438.52 25.71 448 35.94 448h56.11c10.24 0 17.84-9.48 15.62-19.47L82.14 313.65C90.32 307.85 96 298.78 96 288c0-11.57-6.47-21.25-15.66-26.87.76-15.02 8.44-28.3 20.69-36.72L296.6 284.5c9.06 2.78 26.44 6.25 46.79 0l278.95-85.7c23.55-7.24 23.55-38.36 0-45.6zM352.79 315.09c-28.53 8.76-52.84 3.92-65.59 0l-145.02-44.55L128 384c0 35.35 85.96 64 192 64s192-28.65 192-64l-14.18-113.47-145.03 44.56z"/></svg> (Roel H.) PhD student in insurance data science <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 640 512"><path d="M622.34 153.2L343.4 67.5c-15.2-4.67-31.6-4.67-46.79 0L17.66 153.2c-23.54 7.23-23.54 38.36 0 45.59l48.63 14.94c-10.67 13.19-17.23 29.28-17.88 46.9C38.78 266.15 32 276.11 32 288c0 10.78 5.68 19.85 13.86 25.65L20.33 428.53C18.11 438.52 25.71 448 35.94 448h56.11c10.24 0 17.84-9.48 15.62-19.47L82.14 313.65C90.32 307.85 96 298.78 96 288c0-11.57-6.47-21.25-15.66-26.87.76-15.02 8.44-28.3 20.69-36.72L296.6 284.5c9.06 2.78 26.44 6.25 46.79 0l278.95-85.7c23.55-7.24 23.55-38.36 0-45.6zM352.79 315.09c-28.53 8.76-52.84 3.92-65.59 0l-145.02-44.55L128 384c0 35.35 85.96 64 192 64s192-28.65 192-64l-14.18-113.47-145.03 44.56z"/></svg> (Roel V.) PhD, now with Finity Consulting in Sydney (Australia) --- name: checklist # Checklist ☑ Do you have a fairly recent version of R? ```r version$version.string ``` ☑ Do you have a fairly recent version of RStudio? ```r RStudio.Version()$version ## Requires an interactive session but should return something like "[1] ‘1.2.5001’" ``` ☑ Have you installed the R packages listed in the software requirements? --- class: inverse, center, middle name: universe # What's out there - the R universe --- name: why-R # Why R and RStudio? .center[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/indeeddotcom-1.svg" style="display: block; margin: auto;" /> ] <br> .footnote[This graph is created from the search results obtained via [www.indeed.com](https://www.indeed.com) (on Jan 12, 2020), using Grant McDermott's code for `ggplot2`, see lecture 1 in his [Data science for economists course](http://github.com/uo-ec607/lectures).] --- # Why R and RStudio? (cont.) ### Data science positivism - Next to Python, R has become the *de facto* language for data science, with a cutting edge *machine learning toolbox*. - See: [The Popularity of Data Science Software](http://r4stats.com/articles/popularity/) - R is open-source with a very active community of users spanning academia and industry. -- ### Bridge to actuarial science, econometrics and other tools - R has all of the statistics and econometrics support, and is amazingly adaptable as a “glue” language to other programming languages and APIs. - R does not try to be everything to everyone. The RStudio IDE and ecosystem allow for further, seemless integration (with e.g. python, keras, tensorflow or C). - Widely used in actuarial undergraduate programs -- ### Disclaimer + Read more - It's also the language that we know best. - If you want to read more: [R-vs-Python](https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/), [when to use Python or R](https://www.datacamp.com/community/blog/when-to-use-python-or-r) or [Hadley Wickham on the future of R](https://qz.com/1661487/hadley-wickham-on-the-future-of-r-python-and-the-tidyverse/) --- class: clear, center, middle background-image: url("img/tidyverse2.1.png") background-size: cover background-size: 65% background-position: center --- # Welcome to the tidyverse! ><p align="justify">The <b>tidyverse</b> is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. </p> <center> <img src="img/tidyverse_wide.png" width="750"/> </center> More on: [tidyverse](www.tidyverse.org). Install the packages with `install.packages("tidyverse")`. Then run `library(tidyverse)` to load the core tidyverse. --- # Principles of tidy data Three interrelated rules from the [R for data science](https://r4ds.had.co.nz/) book by Garrett Grolemund and Hadley Wickham: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. <center> <img src="img/tidy_data.png" width="750"/> </center> .footnote[This figure is taken from Chapter 12 on Tidy data in [R for data science](https://r4ds.had.co.nz/).] --- # Workflow of a data scientist Here is a model of the .hi-pink[tools needed in a typical data science project]: > <p align="justify"> Together, tidying and transforming are called <b>wrangling</b>, because getting your data in a form that’s natural to work with often feels like a fight! </p> > <p align="justify"> Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally <b>mathematical or computational tool</b>, so they generally scale well. But every model makes <b>assumptions</b>, and by its very nature a model cannot question its own assumptions. That means <b>a model cannot fundamentally surprise you</b>.</p> <center> <img src="img/data_science_pipeline.png" width="600"/> </center> .footnote[Figure and quote taken from Chapter 1 in [R for data science](https://r4ds.had.co.nz/).] --- name: workshop-outline # Today's Outline .pull-left[ * [Prologue](#prologue) * [What's out there: the R universe](#universe) - why R and RStudio? - welcome to the tidyverse! - principles of tidy data - workflow of a data scientist * [Data wrangling and visualisation](#wrangling) - tibbles - Pipe operator `%>%` - {dplyr} instructions - {ggplot2} ] .pull-right[ * [Data set used in the workshop](#data-sets) - exploratory data analysis, including maps with {sf}, {tmap} * [Model building](#model-building) - basic concepts of `glm` and `mgcv::gam` for GAMs - smooth effect of a continuous risk factor, a spatial smoother - smooth effect of an interaction between two continuous risk factors * [From GAM to GLM](#from-gam-to glm) - binning the smooth spatial effect using {classInt} - binning a smoother of a continuous risk factor using {evtree} - putting it all together. ] --- class: inverse, center, middle name: wrangling # Preliminaries on data wrangling and visualisation --- # A tibble instead of a data.frame <img src="img/tibble.png" class="title-hex"> Within the tidyverse `tibble` is a modern take on a `data.frame`: - keep the features that have stood the test of time - drop the features that used to be convenient but are now frustrating. You can use: - `tibble()` to create a new tibble - `as_tibble()` transforms an object (e.g. a data frame) into a tibble. Quick example: explore the differences! ```r mtcars # install.packages("tidyverse") library(tidyverse) as_tibble(mtcars) ``` --- # Chains with the pipe operator <img src="img/pipe.png" class="title-hex"> In R, the pipe operator is `%>%`. It takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”; with this operator it becomes easy to chain a sequence of calculations. For example, when you have an input data and want to call functions `foo` and `bar` in sequence, you can write `data %>% foo %>% bar`. -- A first example: - take the `diamonds` data (from the {ggplot2} package) - then subset ```r diamonds %>% filter(cut == "Ideal") ``` -- Some excellent blog posts about this operator: [Pipes in R tutorial for beginners](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) and [how to write this in base R](https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec). --- # Data manipulation verbs <img src="img/dplyr.png" class="title-hex"> The {dplyr} package holds many useful data manipulation verbs: - `mutate()` adds new variables that are functions of existing variables - `select()` picks variables based on their names - `filter()` picks cases based on their values - `summarise()` reduces multiple values down to a single summary - `arrange()` changes the ordering of the rows. These all combine naturally with `group_by()` which allows you to perform any operation “by group”. -- A first example: ```r diamonds %>% mutate(price_per_carat = price/carat) %>% filter(price_per_carat > 1500) ``` or ```r diamonds %>% group_by(cut) %>% summarize(price = mean(price), carat = mean(carat)) ``` --- # Plots with ggplot2 <img src="img/ggplot2.png" class="title-hex"> The aim of the {ggplot2} package is to create elegant data visualisations using the .hi-pink[grammar of graphics]. -- Here are the basic steps: - begin a plot with the function `ggplot()` creating a coordinate system that you can add layers to - the first argument of `ggplot()` is the dataset to use in the graph -- A first example ```r library(ggplot2) ggplot(data = mpg) ggplot(mpg) ``` creates an empty graph. You will now add layers to this graph! --- # Plots with ggplot2 <img src="img/ggplot2.png" class="title-hex"> You complete your graph by adding one or more .hi-pink[layers] to `ggplot()`. -- For example: - `geom_point()` adds a layer of points to your plot, which creates a scatterplot - `geom_smooth()` adds a smooth line - `geom_bar` a bar plot and many more, see [ggplot2 documentation](https://ggplot2.tidyverse.org/). -- Each `geom` function in `ggplot2` takes an aesthetic mapping argument: - maps variables in your dataset to visual properties - always paired with `aes()` and the `\(x\)` and `\(y\)` arguments of `aes()` specify which variables to map to the `\(x\)` and `\(y\)` axes. --- class: clear .pull-left[ ```r library(ggplot2) ggplot(mpg, `aes(displ, hwy, colour = class)`) + `geom_point()` + `theme_bw()` ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-9-1.svg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ Extend the empty graph now with (here: global) aesthetic mapping argument `aes(displ, hwy, colour = class)`. This implies: `displ` on the x-axis, `hwy` on the y-axis and `class` to differentiate the color of the plotting symbol. With `geom_point` you add a layer of points to the empty graph. `theme_bw()` changes the `ggtheme` to a simple black-and-white theme. ] --- class: inverse, center, middle name: data-sets # Data set used in the workshop <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: data-sets-used # Data set used in this course - MTPL <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex"> We will use the Motor Third Party Liability data set. There are 163,231 policyholders in this data set. The frequency of claiming (`nclaims`) and corresponding severity (`avg`, the amount paid on average per claim reported by a policyholder) are the .KULbginline[target variables] in this data set. Predictor variables are: * the exposure-to-risk, the duration of the insurance coverage (max. 1 year) * factor variables, e.g. gender, coverage, fuel * continuous, numeric variables, e.g. age of the policyholder, age of the car * spatial information: postal code (in Belgium) of the municipality where the policyholder resides. More details in [Henckaerts et al. (2018, Scandinavian Actuarial Journal)](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/#data-driven) and [Henckaerts et al. (2019, arxiv)](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/#tree-based-pricing). --- name: data-sets-used # Data set used in this course - MTPL <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex"> You can load the data from a script in the `scripts` folder as follows: ```r # install.packages("rstudioapi") dir <- dirname(rstudioapi::getActiveDocumentContext()$path) setwd(dir) mtpl_orig <- read.table('./data/P&Cdata.txt', header = TRUE) mtpl_orig <- as_tibble(mtpl_orig) ``` Some basic exploratory steps with this data follow on the next sheet. --- name: data-sets-used # Data set used in this course - MTPL <img src="img/pipe.png" class="title-hex"> <img src="img/dplyr.png" class="title-hex"> <img src="img/ggplot2.png" class="title-hex"> Note that the data `mtpl_orig` uses capitals for the variable names ```r mtpl_orig %>% slice(1:3) %>% select(-LONG, -LAT) %>% kable(format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> NCLAIMS </th> <th style="text-align:right;"> AMOUNT </th> <th style="text-align:right;"> AVG </th> <th style="text-align:right;"> EXP </th> <th style="text-align:left;"> COVERAGE </th> <th style="text-align:left;"> FUEL </th> <th style="text-align:left;"> USE </th> <th style="text-align:left;"> FLEET </th> <th style="text-align:left;"> SEX </th> <th style="text-align:right;"> AGEPH </th> <th style="text-align:right;"> BM </th> <th style="text-align:right;"> AGEC </th> <th style="text-align:right;"> POWER </th> <th style="text-align:right;"> PC </th> <th style="text-align:left;"> TOWN </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1618.001 </td> <td style="text-align:right;"> 1618.001 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TPL </td> <td style="text-align:left;"> gasoline </td> <td style="text-align:left;"> private </td> <td style="text-align:left;"> N </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 77 </td> <td style="text-align:right;"> 1000 </td> <td style="text-align:left;"> BRUSSEL </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> PO </td> <td style="text-align:left;"> gasoline </td> <td style="text-align:left;"> private </td> <td style="text-align:left;"> N </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 66 </td> <td style="text-align:right;"> 1000 </td> <td style="text-align:left;"> BRUSSEL </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TPL </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:left;"> N </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 1000 </td> <td style="text-align:left;"> BRUSSEL </td> </tr> </tbody> </table> We change this to lower case variables, and rename `exp` to `expo`. ```r mtpl <- mtpl_orig %>% # rename all columns rename_all(function(.name) { .name %>% # replace all names with the lowercase versions tolower }) mtpl <- rename(mtpl, expo = exp) ``` --- class: clear name: first-steps-MTPL .pull-left[ Let's explore different ways to calculate the .KULbginline[empirical claim frequency]. Inspect the following instructions and list pros and cons: ```r mean(mtpl$nclaims) sum(mtpl$nclaims)/sum(mtpl$expo) mtpl %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ``` ``` ## [1] 0.1239715 ## [1] 0.1393352 ``` <table> <thead> <tr> <th style="text-align:right;"> emp_freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1393352 </td> </tr> </tbody> </table> ] .pull-right[ What about the .KULbginline[empirical variance]? ```r m <- sum(mtpl$nclaims)/sum(mtpl$expo) m ## [1] 0.1393352 var <- sum((mtpl$nclaims - m * mtpl$expo)^2)/sum(mtpl$expo) var ## [1] 0.1517246 ``` Here we use the expression for the variance: $$\text{Var}(X) = \text{E}[(X-\text{E}X)^2] $$ We empirically calculate the first expected value by taking exposure into account. Moreover, we compare each realization of `nclaims` with its expected value, i.e. `m * expo` where `m` is the global empirical claim frequency. ] --- class: clear .pull-left[ ```r dim(mtpl) ``` ``` ## [1] 163231 18 ``` ```r mtpl %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ``` <table> <thead> <tr> <th style="text-align:right;"> emp_freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.1393352 </td> </tr> </tbody> </table> ```r mtpl %>% group_by(sex) %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ``` <table> <thead> <tr> <th style="text-align:left;"> sex </th> <th style="text-align:right;"> emp_freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 0.1484325 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0.1361164 </td> </tr> </tbody> </table> ] .pull-right[ ```r KULbg <- "#116E8A" g <- ggplot(mtpl, aes(nclaims)) + theme_bw() + geom_bar(col = KULbg, fill = KULbg) + labs(y = "Abs frequency") + ggtitle("MTPL - number of claims") g ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-21-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ ```r KULbg <- "#116E8A" g <- ggplot(mtpl, aes(nclaims)) + theme_bw() + geom_bar(aes(weight = expo), col = KULbg, fill = KULbg) + labs(y = "Abs freq (in exposure)") + ggtitle("MTPL - number of claims") g ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-22-1.svg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ```r g <- ggplot(mtpl, aes(nclaims)) + theme_bw() g + geom_bar(aes(y = (..count..)/sum(..count..)), col = KULbg, fill = KULbg) + labs(y = "Relative frequency") + ggtitle("MTPL - relative number of claims") ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-24-1.svg" width="80%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ We now step from the barplot to a histogram (in {ggplot2}), see [ggplot2 histogram](http://ggplot2.tidyverse.org/reference/geom_histogram.html). Here is the histogram of `bm` showing how the policyholders are distributed over levels in the bonus-malus scale: ```r g <- ggplot(mtpl, aes(bm)) + theme_bw() g + geom_histogram(binwidth = 1, col = KULbg, fill = KULbg, alpha = .5) ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-26-1.svg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ For the relative frequency histogram ```r g <- ggplot(mtpl, aes(bm)) + theme_bw() g + geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 1, col = KULbg, fill = KULbg, alpha = .5) + labs(y = "Relative frequency") ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-28-1.svg" width="70%" style="display: block; margin: auto;" /> ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ To get warmed up, let's load the `mtpl` data and do some .KULbginline[basic investigations] into the variables. The idea is to get a feel for the data. Your starting point are the instructions in the R script on data explorations from the [course material](https://github.com/katrienantonio/PE-pricing-analytics). .hi-pink[Q]: you will work through the following exploratory steps. 1. Visualize the distribution of the `ageph` with a histogram. 2. For each age recorded in the data set `mtpl`: what is the total number of observations, the total exposure, and the corresponding total number of claims reported? 3. Calculate the empirical claim frequency, per unit of exposure, for each age and picture it. Discuss this figure. 4. Repeat the above for `bm`, the level occupied by the policyholder in the Belgian bonus-malus scale. <br>
10
:
00
] --- class: clear .pull-left[ For .hi-pink[Q.1] a histogram of `ageph` ```r ggplot(data = mtpl, aes(ageph)) + theme_bw() + geom_histogram(binwidth = 2, col = KULbg, fill = KULbg, alpha = .5) + labs(y = "Absolute frequency") + ggtitle("MTPL - age policyholder") ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-30-1.svg" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ For .hi-pink[Q.2] for each `ageph` recorded ```r mtpl %>% group_by(ageph) %>% summarize(tot_claims = sum(nclaims), tot_expo = sum(expo), tot_obs = n()) ``` <table> <thead> <tr> <th style="text-align:right;"> ageph </th> <th style="text-align:right;"> tot_claims </th> <th style="text-align:right;"> tot_expo </th> <th style="text-align:right;"> tot_obs </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4.621918 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:right;"> 93.021918 </td> <td style="text-align:right;"> 116 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 113 </td> <td style="text-align:right;"> 342.284932 </td> <td style="text-align:right;"> 393 </td> </tr> <tr> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 165 </td> <td style="text-align:right;"> 597.389041 </td> <td style="text-align:right;"> 701 </td> </tr> <tr> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 202 </td> <td style="text-align:right;"> 778.827397 </td> <td style="text-align:right;"> 952 </td> </tr> <tr> <td style="text-align:right;"> 23 </td> <td style="text-align:right;"> 297 </td> <td style="text-align:right;"> 1165.358904 </td> <td style="text-align:right;"> 1379 </td> </tr> <tr> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 426 </td> <td style="text-align:right;"> 1752.249315 </td> <td style="text-align:right;"> 2028 </td> </tr> <tr> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 546 </td> <td style="text-align:right;"> 2343.504110 </td> <td style="text-align:right;"> 2673 </td> </tr> </tbody> </table> ] --- class: clear For .hi-pink[Q.3] for each `ageph` recorded .pull-left[ ```r freq_by_age <- mtpl %>% group_by(ageph) %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ggplot(freq_by_age, aes(x = ageph, y = emp_freq)) + theme_bw() + geom_bar(stat = "identity", color = KULbg, fill = KULbg, alpha = .5) + ggtitle("MTPL - empirical claim freq per age policyholder") ``` For .hi-pink[Q.4] recycle the above instructions and replace `ageph` with `bm`. ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-34-1.svg" width="85%" style="display: block; margin: auto;" /> ] --- class: clear To reproduce the graphs from the paper, we write functions as a wrapper for the instructions used above. ```r # or, use the colors from our paper col <- KULbg fill <- KULbg ylab <- "Relative frequency" # wrapper functions ggplot.bar <- function(DT, variable, xlab){ ggplot(data = DT, aes(as.factor(variable))) + theme_bw() + geom_bar(aes(y = (..count..)/sum(..count..)), col = col, fill = fill, alpha = 0.5) + labs(x = xlab, y = ylab) } ggplot.hist <- function(DT, variable, xlab, binwidth){ ggplot(data = DT, aes(variable)) + theme_bw() + geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = binwidth, col = col, fill = fill, alpha = 0.5) + labs(x = xlab, y = ylab) } ``` Let's give these a try! --- class: clear <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-36-1.svg" style="display: block; margin: auto;" /> --- # Spatial data with {rgdal} With the {rgdal} package we load the .hi-pink[shapefile] of Belgium at postal code level as follows: ```r library(rgdal) readShapefile = function(){ belgium_shape <- readOGR(dsn = path.expand(paste(getwd(), "./shape file Belgie postcodes", sep = "")), layer = "npc96_region_Project1") belgium_shape <- spTransform(belgium_shape, CRS("+proj=longlat +datum=WGS84")) belgium_shape$id <- row.names(belgium_shape) return(belgium_shape) } belgium_shape = readShapefile() ## OGR data source with driver: ESRI Shapefile ## Source: "C:\Users\u0043788\Dropbox\Data science for non-life insurance\Computer labs\Pricing analytics in xaringan\shape file Belgie postcodes", layer: "npc96_region_Project1" ## with 1146 features ## It has 6 fields class(belgium_shape) ## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp" ``` --- # Spatial data with {rgdal} With the {rgdal} package we load the .hi-pink[shapefile] of Belgium at postal code level as follows. Attributes or variables are stored as a data frame in the `@data` of the `SpatialPolygonsDataFrame`. You can access it as follows. ```r str(belgium_shape@data) ## 'data.frame': 1146 obs. of 7 variables: ## $ POSTCODE : int 1300 1301 1310 1315 1320 1325 1330 1331 1332 1340 ... ## $ NAAM1 : Factor w/ 22 levels "ANDERLECHT","BRUSSEL",..: NA NA NA NA NA NA NA NA NA NA ... ## $ NAAM2 : Factor w/ 1138 levels "'s Gravenvoeren",..: 1070 118 561 499 82 204 844 854 358 778 ... ## $ POSTCODELA: Factor w/ 1146 levels "1000","1020",..: 23 24 25 26 27 28 29 30 31 32 ... ## $ Shape_Leng: num 42814 15954 20276 43081 29648 ... ## $ Shape_Area: num 32629722 9641834 15381287 38638405 38812269 ... ## $ id : chr "0" "1" "2" "3" ... ``` --- class: clear .pull-left[ Plotting just an empty map of Belgium, i.e. postcal code areas not filled with any color then goes as follows: ```r plot.eda.map <- ggplot(belgium_shape, aes(long, lat, group = group)) + geom_polygon(fill = NA, colour = "black", size = 0.1) + theme_bw() plot.eda.map ``` Under the hood, the `ggplot` instruction transforms the shapefile into a data frame. This is what the `fortify` function does, e.g. ```r fortify(belgium_shape) %>% slice(1:3) ## long lat order hole piece id group ## 1 4.550931 50.74536 1 FALSE 1 0 0.1 ## 2 4.552253 50.74755 2 FALSE 1 0 0.1 ## 3 4.555582 50.74903 3 FALSE 1 0 0.1 ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-41-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ A quick, interactive look using `mapview` from the {mapview} package. This package provides functions to very quickly and conveniently create interactive visualisations of spatial data. ```r mapview(belgium_shape) ``` ] .pull-right[ ```r mapview(belgium_shape) ```
] --- # Spatial data with {sf} Alternatively, we can use the {sf} package to load the .hi-pink[shapefile] of Belgium at postal code level as follows: ```r library(sf) belgium_shape_sf <- st_read('./shape file Belgie postcodes/npc96_region_Project1.shp', quiet = TRUE) belgium_shape_sf <- st_transform(belgium_shape_sf, "+proj=longlat +datum=WGS84") ``` We inspect the resulting object: ```r class(belgium_shape_sf) ## [1] "sf" "data.frame" belgium_shape_sf %>% as_tibble() %>% select(-geometry) %>% slice(1:3) %>% kable(format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> POSTCODE </th> <th style="text-align:left;"> NAAM1 </th> <th style="text-align:left;"> NAAM2 </th> <th style="text-align:left;"> POSTCODELA </th> <th style="text-align:right;"> Shape_Leng </th> <th style="text-align:right;"> Shape_Area </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1300 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Wavre </td> <td style="text-align:left;"> 1300 </td> <td style="text-align:right;"> 42814.06 </td> <td style="text-align:right;"> 32629722 </td> </tr> <tr> <td style="text-align:right;"> 1301 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Bierges </td> <td style="text-align:left;"> 1301 </td> <td style="text-align:right;"> 15954.49 </td> <td style="text-align:right;"> 9641834 </td> </tr> <tr> <td style="text-align:right;"> 1310 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> La Hulpe </td> <td style="text-align:left;"> 1310 </td> <td style="text-align:right;"> 20276.46 </td> <td style="text-align:right;"> 15381287 </td> </tr> </tbody> </table> --- class: clear .pull-left[ With the packages {sp} and {rgdal}, spatial objects had to be converted to data frames (e.g. with `fortify()`) prior to plotting with {ggplot2}. However, with {sf} the objects are already data frames and they can be plotted with the help of the new `geom_sf()` layer in {ggplot2}. This offers huge advantages! ```r ggplot(belgium_shape_sf) + geom_sf() + ggtitle("Welcome to Belgium!") + theme_bw() ``` Have a look at [Tidy spatial data in R](http://strimas.com/r/tidy-sf/) for more details. ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-47-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ We can again build the interactive map with `mapview` from the object `belgium_shape_sf`. Give it a try! Now we can also build nice static, thematic maps using the {tmap} package. ```r library(tmap) # qtm(belgium_shape_sf) # does not work # shapefile slightly corrupted! # slightly smooth the shapefile simple_shp <- st_simplify(belgium_shape_sf, dTolerance = 0.00001) # and plot qtm(simple_shp) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-49-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ We can also play with these maps ```r tm_shape(simple_shp) + tm_borders(col = KULbg, lwd = 0.5) + tm_layout(main.title = 'Welcome to Belgium!', legend.outside = TRUE, frame = FALSE) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-51-1.svg" style="display: block; margin: auto;" /> ] --- # Summary statistics at postal code level Our goal is now to calculate summary statistics at postal code, followed by a visualisation of these summary stats using maps. For example, what is the total exposure observed per postal code? ```r post_expo <- mtpl %>% group_by(pc) %>% summarize(num = n(), total_expo = sum(expo)) post_expo %>% slice(1:5) %>% kable(format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> pc </th> <th style="text-align:right;"> num </th> <th style="text-align:right;"> total_expo </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1000 </td> <td style="text-align:right;"> 1171 </td> <td style="text-align:right;"> 961.3178 </td> </tr> <tr> <td style="text-align:right;"> 1030 </td> <td style="text-align:right;"> 928 </td> <td style="text-align:right;"> 738.3397 </td> </tr> <tr> <td style="text-align:right;"> 1040 </td> <td style="text-align:right;"> 433 </td> <td style="text-align:right;"> 357.2904 </td> </tr> <tr> <td style="text-align:right;"> 1050 </td> <td style="text-align:right;"> 643 </td> <td style="text-align:right;"> 524.6164 </td> </tr> <tr> <td style="text-align:right;"> 1060 </td> <td style="text-align:right;"> 344 </td> <td style="text-align:right;"> 279.9452 </td> </tr> </tbody> </table> We recorded (non-zero) exposures for 583 of the 1,146 postal codes in Belgium. --- class: clear As a first step, we take the {rgdal} approach and use the `SpatialPolygonsDataFrame` stored in `belgium_shape`. We merge the summary stats at postal code level with the data variable stored in `belgium_shape@data`. ```r library(dplyr) belgium_shape@data <- left_join(belgium_shape@data, post_expo, by = c("POSTCODE" = "pc")) belgium_shape@data %>% slice(1:3) %>% kable(format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> POSTCODE </th> <th style="text-align:left;"> NAAM1 </th> <th style="text-align:left;"> NAAM2 </th> <th style="text-align:left;"> POSTCODELA </th> <th style="text-align:right;"> Shape_Leng </th> <th style="text-align:right;"> Shape_Area </th> <th style="text-align:left;"> id </th> <th style="text-align:right;"> num </th> <th style="text-align:right;"> total_expo </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1300 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Wavre </td> <td style="text-align:left;"> 1300 </td> <td style="text-align:right;"> 42814.06 </td> <td style="text-align:right;"> 32629722 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:right;"> 477 </td> <td style="text-align:right;"> 383.87397 </td> </tr> <tr> <td style="text-align:right;"> 1301 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Bierges </td> <td style="text-align:left;"> 1301 </td> <td style="text-align:right;"> 15954.49 </td> <td style="text-align:right;"> 9641834 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 1310 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> La Hulpe </td> <td style="text-align:left;"> 1310 </td> <td style="text-align:right;"> 20276.46 </td> <td style="text-align:right;"> 15381287 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 109 </td> <td style="text-align:right;"> 90.90411 </td> </tr> </tbody> </table> and a quick check to verify our `left_join` ```r post_expo %>% filter(pc == 1301) %>% kable(format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> pc </th> <th style="text-align:right;"> num </th> <th style="text-align:right;"> total_expo </th> </tr> </thead> <tbody> <tr> </tr> </tbody> </table> --- class: clear Next, we calculate the relative exposure per area unit. ```r belgium_shape@data$freq <- belgium_shape@data$total_expo/belgium_shape@data$Shape_Area ``` and we transform this continuous variable to a binned version using `cut()` ```r belgium_shape@data$freq_class <- cut(belgium_shape@data$freq, breaks = quantile(belgium_shape@data$freq, c(0,0.2,0.8,1), na.rm = TRUE), right = FALSE, include.lowest = TRUE, labels = c("low","average","high")) belgium_shape@data %>% slice(1:3) %>% kable(format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> POSTCODE </th> <th style="text-align:left;"> NAAM1 </th> <th style="text-align:left;"> NAAM2 </th> <th style="text-align:left;"> POSTCODELA </th> <th style="text-align:right;"> Shape_Leng </th> <th style="text-align:right;"> Shape_Area </th> <th style="text-align:left;"> id </th> <th style="text-align:right;"> num </th> <th style="text-align:right;"> total_expo </th> <th style="text-align:right;"> freq </th> <th style="text-align:left;"> freq_class </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1300 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Wavre </td> <td style="text-align:left;"> 1300 </td> <td style="text-align:right;"> 42814.06 </td> <td style="text-align:right;"> 32629722 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:right;"> 477 </td> <td style="text-align:right;"> 383.87397 </td> <td style="text-align:right;"> 1.18e-05 </td> <td style="text-align:left;"> average </td> </tr> <tr> <td style="text-align:right;"> 1301 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Bierges </td> <td style="text-align:left;"> 1301 </td> <td style="text-align:right;"> 15954.49 </td> <td style="text-align:right;"> 9641834 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:right;"> 1310 </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> La Hulpe </td> <td style="text-align:left;"> 1310 </td> <td style="text-align:right;"> 20276.46 </td> <td style="text-align:right;"> 15381287 </td> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 109 </td> <td style="text-align:right;"> 90.90411 </td> <td style="text-align:right;"> 5.90e-06 </td> <td style="text-align:left;"> average </td> </tr> </tbody> </table> --- class: clear .pull-left[ Finally, to use this with `ggplot` we need `fortify` to transform the original shape file to a data frame ```r belgium_shape_f <- fortify(belgium_shape) ``` We merge the resulting data frame `belgium_shape_f` with `belgium_shape@data` ```r belgium_shape_f <- left_join(belgium_shape_f, belgium_shape@data) ``` and we use the result to plot the map with `ggplot` ```r plot.eda.map <- ggplot(belgium_shape_f, aes(long, lat, group = group)) + geom_polygon(aes(fill = belgium_shape_f$freq_class), colour = "black", size = 0.1) plot.eda.map <- plot.eda.map + theme_bw() + labs(fill = "Relative\nfrequency") + scale_fill_brewer(palette = "Blues", na.value = "white") plot.eda.map ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-60-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ Alternatively, you can merge the `belgium_shape_sf` object with the summary at postal code level, stored in `post_expo`. .hi-pink[Q]: you will work through the following steps. 1. Return to the object of class `sf` and class data frame stored in `belgium_shape_sf`. 2. Merge this data frame with the `post_expo` data. 3. Calculate relative exposure per area unit, discretize this into 3 groups (low - average - high). 4. Colour each Belgian municipality according to NA - low - average - high, use `ggplot` and {tmap}. <br>
10
:
00
] --- class: clear Alternatively, we merge the `belgium_shape_sf` object with the constructed summary at postal code level, stored in `post_expo`. ```r belgium_shape_sf <- left_join(belgium_shape_sf, post_expo, by = c("POSTCODE" = "pc")) ``` We compute the relative exposure per area unit. ```r belgium_shape_sf$freq <- belgium_shape_sf$total_expo/belgium_shape_sf$Shape_Area ``` And we transform this continuous variable to a binned version using `cut()` ```r belgium_shape_sf$freq_class <- cut(belgium_shape_sf$freq, breaks = quantile(belgium_shape_sf$freq, c(0,0.2,0.8,1), na.rm = TRUE), right = FALSE, include.lowest = TRUE, labels = c("low", "average", "high")) ``` --- class: clear The resulting map can then easily be displayed with {ggplot} or with {tmap} instructions, starting from the `belgium_shape_sf` object. .pull-left[ ```r ggplot(belgium_shape_sf) + geom_sf(aes(fill = freq_class), colour = "black", size = 0.1) + ggtitle("MTPL claim frequency data") + labs(fill = "Relative\nexposure") + scale_fill_brewer(palette = "Blues", na.value = "white") + theme_bw() ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-65-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ ```r library(tmap) # slightly smooth the shapefile belgium_shape_sf <- st_simplify(belgium_shape_sf, dTolerance = 0.00001) # and plot tm_shape(belgium_shape_sf) + tm_borders(col = "black") + tm_fill(col = "freq_class", style = "cont", palette = "Blues", colorNA = "white") ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-67-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ Finally, for interactive maps we can use e.g. ```r tmap_leaflet(last_map()) ``` ] .pull-right[
] --- class: inverse, center, middle name: model-building # Model building with GLMs and GAMs <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Linear and Generalized Linear Models .pull-left-alt[ .center[ <img src="img/esbjorn_GLM.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> <br> <br> <img src="img/de_jong_GLM.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> ] ] .pull-right-alt[ With .hi-pink[linear regression models] `lm(.)` - model specification `$$\begin{eqnarray*} \color{#FFA500}{Y} = \color{#e64173}{x}^{'}\color{#20B2AA}{\beta} + \epsilon. \end{eqnarray*}$$` - `\(\epsilon\)` is normally distributed with mean 0 and common variance `\(\sigma^2\)`, thus: `\(\color{#FFA500}{Y}\)` is normal with mean `\(\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}\)` and variance `\(\sigma^2\)` With .hi-pink[generalized linear regression models] `glm(.)` - model specification `$$\begin{eqnarray*} g(E[\color{#FFA500}{Y}]) = \color{#e64173}{x}^{'}\color{#20B2AA}{\beta}. \end{eqnarray*}$$` - `\(g(.)\)` is the link function - `\(\color{#FFA500}{Y}\)` follows a distribution from the exponential family. ] --- # Generalized Linear Models (GLMs) We return to the `mtpl` data set. .pull-left[ Target variable `nclaims` (frequency) <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-70-1.svg" width="65%" style="display: block; margin: auto;" /> Suitable distributions: Poisson, Negative Binomial. ] .pull-right[ ... and `avg` (severity). <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-71-1.svg" width="65%" style="display: block; margin: auto;" /> Suitable distributions: log-normal, gamma. ] --- # A Poisson GLM .pull-left[ A brief recap.. ```r freq_by_gender <- mtpl %>% group_by(sex) %>% summarize(emp_freq = sum(nclaims) / sum(expo)) ``` <table> <thead> <tr> <th style="text-align:left;"> sex </th> <th style="text-align:right;"> emp_freq </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 0.1484325 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0.1361164 </td> </tr> </tbody> </table> Let's picture the empirical gender-specific claim frequency... ```r ggplot(freq_by_gender, aes(x = sex, y = emp_freq)) + geom_bar(col = KULbg, fill = KULbg, alpha = .5) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-75-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # A Poisson GLM (cont.) .pull-left[ ```r freq_glm_1 <- `glm`(nclaims ~ sex, offset = log(expo), `family = poisson(link = "log")`, `data = mtpl`) ``` ] .pull-right[ Fit a .KULbginline[Poisson GLM], with .KULbginline[logarithmic link] function. This implies: `\(\color{#FFA500}{Y}\)` ~ Poisson, with `$$\begin{eqnarray*} \log(E[\color{#FFA500}{Y}]) &=& \color{#e64173}{x}^{'}\color{#20B2AA}{\beta}, \end{eqnarray*}$$` or, `$$E[\color{#FFA500}{Y}] = \exp{(\color{#e64173}{x}^{'}\color{#20B2AA}{\beta})}.$$` Fit this model on `data = mtpl`. ] --- # A Poisson GLM (cont.) .pull-left[ ```r freq_glm_1 <- glm(`nclaims ~ sex`, `offset = log(expo)`, family = poisson(link = "log"), data = mtpl) ``` ] .pull-right[ Use `nclaims` as `\(\color{#FFA500}{Y}\)`. Use `gender` as the only (factor) variable in the linear predictor. Include `log(exp)` as an offset term in the linear predictor. Then, `$$\begin{eqnarray*} \color{#e64173}{x}^{'}\color{#20B2AA}{\beta} = \log{(\texttt{expo})}+\beta_0 + \beta_1 \mathbb{I}(\texttt{male}). \end{eqnarray*}$$` Put otherwise, `$$\begin{eqnarray*} E[\color{#FFA500}{Y}] = \texttt{expo} \cdot \exp{(\beta_0 + \beta_1 \mathbb{I}(\texttt{male}))} \end{eqnarray*},$$` where `\(\texttt{expo}\)` refers to `expo` the exposure variable. ] --- class: clear .pull-left[ ```r freq_glm_1 <- glm(nclaims ~ sex, `offset = log(expo)`, family = poisson(link = "log"), data = mtpl) ``` ```r freq_glm_1 %>% broom::tidy() ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.9076251 </td> <td style="text-align:right;"> 0.0133227 </td> <td style="text-align:right;"> -143.186324 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> sexmale </td> <td style="text-align:right;"> -0.0866198 </td> <td style="text-align:right;"> 0.0156837 </td> <td style="text-align:right;"> -5.522931 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> Mind the specification of `type.predict` when using `augment` with a GLM! ```r freq_glm_1 %>% broom::augment(type.predict = "response") ``` <table> <thead> <tr> <th style="text-align:right;"> nclaims </th> <th style="text-align:left;"> sex </th> <th style="text-align:right;"> .fitted </th> <th style="text-align:right;"> .se.fit </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0.1361164 </td> <td style="text-align:right;"> 0.0011264 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 0.1484325 </td> <td style="text-align:right;"> 0.0019775 </td> </tr> </tbody> </table> ] .pull-right[ The `predict` function of a GLM object offers 3 options: `"link"`, `"response"` or `"terms"`. The same options hold when `augment()` is applied to a GLM object. Let's see how the fitted values at `"response"` level are constructed: ```r exp(coef(freq_glm_1)[1]) ## (Intercept) ## 0.1484325 exp(coef(freq_glm_1)[1] + coef(freq_glm_1)[2]) ## (Intercept) ## 0.1361164 ``` Do you recognize these numbers? Last step: try `freq_glm_1 %>% glance()` or `summary(freq_glm_1)` for deviances. ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will further explore GLMs in R with the `glm(.)` function. .hi-pink[Q]: continue with the `freq_glm_1` object that was created, you will now explicitly call the `predict()` function on this object. 1. Verify the arguments of `predict.glm` using `? predict.glm`. 2. The help reveals the following structure `predict(.object, .newdata, type = ("..."))` where `.object` is the fitted GLM object, `.newdata` is (optionally) a data frame to look for the features used in the model, and `type` is `"link"`, `"response"` or `"terms"`. <br> Use `predict` with `freq_glm_1` and a newly created data frame. <br> Explore the different options for `type`, and their connections. 3. Fit a gamma GLM for `avg` (the claim severity) with log link. <br> Use `sex` as the only variable in the model. What do you conclude? ] --- class: clear .pull-left[ .hi-pink[Q.1] You can access the documentation via `? predict.glm`. .hi-pink[Q.2] You create new data frames (or tibbles) as follows ```r male_driver <- data.frame(exp = 1, sex = "male") female_driver <- data.frame(exp = 1, sex = "female") ``` Next, you apply `predict` with the GLM object `freq_glm_1` and one of these data frames, e.g. ```r predict(freq_glm_1, newdata = male_driver, type = "response") ``` ``` ## 1 ## 0.1361164 ``` ] .pull-right[ .hi-pink[Q.2] Next, you apply `predict` with the GLM object `freq_glm_1` and one of these data frames, e.g. ```r predict(freq_glm_1, newdata = male_driver, type = "response") ``` ``` ## 1 ## 0.1361164 ``` At the level of the linear predictor: ```r predict(freq_glm_1, newdata = male_driver, type = "link") ``` ``` ## 1 ## -1.994245 ``` ```r exp(predict(freq_glm_1, newdata = male_driver, type = "link")) ``` ``` ## 1 ## 0.1361164 ``` ] --- class: clear .hi-pink[Q.3] For the gamma regression model ```r sev_glm_1 <- glm(avg ~ sex, family = Gamma(link = "log"), data = mtpl) sev_glm_1 ``` ``` ## ## Call: glm(formula = avg ~ sex, family = Gamma(link = "log"), data = mtpl) ## ## Coefficients: ## (Intercept) sexmale ## 7.5730 -0.2581 ## ## Degrees of Freedom: 18294 Total (i.e. Null); 18293 Residual ## (144936 observations deleted due to missingness) ## Null Deviance: 46690 ## Residual Deviance: 46440 AIC: 299700 ``` --- # Generalized Additive Models (GAMs) .pull-left-alt[ .center[ <img src="img/GAM_Hastie_Tibshirani.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> <br> <br> <img src="img/GAM_Wood.jpg" alt="Drawing" style="width: 150px; height: 220px;"/> ] ] .pull-right-alt[ With .hi-pink[GLMs] `glm(.)` - transformation of the mean modelled with a linear predictor `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}$$` - not well suited for continuous risk factors that relate to the response in a non-linear way. With .hi-pink[Generalized Additive Models (GAMs)] - the predictor allows for smooth effects of continuous risk factors and spatial covariates, next to the linear terms, e.g. `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}+\sum_j f_j(x_j) + f(\texttt{lat}, \texttt{long})$$` - predictor is still additive - preferred R package is {mgcv} by Simon Wood. ] --- # More on GAMs .pull-left[ So, a GAM is a GLM where the linear predictor depends on .hi-pink[smooth functions] of covariates. Consider a GAM with the following predictor: `$$\color{#e64173}{x}^{'}\color{#20B2AA}{\beta}+f_j(x_j).$$` GAMs use .hi-pink[basis functions] to estimate the smooth effect `\(f_j(.)\)` `$$f_j(x_j) = \sum_{m=1}^M \beta_{jm} b_{jm}(x_j),$$` where the `\(b_{jm}(x)\)` are known basis functions and `\(\beta_{jm}\)` are coefficients that have to be estimated. ] .pull-right[ GAMs avoid overfitting by adding a .hi-pink[wiggliness penalty] to the likelihood `$$\int \left(f_j(x)^{''}\right)^{2} = \beta_j^t \mathbf{S}_j \beta_j.$$` GAMs then balance goodness-of-fit and wiggliness via `$$\log \mathcal{L}(\beta, \beta_j) - \lambda_j \cdot \beta_j^t \mathbf{S}_j \beta_j,$$` with `\(\lambda_j\)` the .hi-pink[smoothing parameter]. The smoothing parameter `\(\lambda_j\)` controls the trade-off between fit & smoothness. ] --- class: clear Let's run some experiments to illustrate the effect of the smoothing parameter (`sp = .`), the number (`k = .`) and type of basis functions (`bs = .`). We use the `mcycle` data from {MASS}. <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/gam-smoothing-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Fitting a GAM with {mgcv} .pull-left[ Fitting a GAM with {mgcv} - the Mixed GAM Computation Vehicle with automatic smoothness estimation - essentially goes as follows: ```r model <- gam(accel ~ `s(times, sp = 1.2,` `k = 5, bs = "cr")`, family = gaussian, data = mcycle) ``` ] .pull-right[ Include a smooth effect of `times` via `s(times)`. `sp = .` specifies a value for the smoothing parameter. `k = .` fixes the number of basis functions. `bs = "cr"` indicates which type of basis functions should be used. Here `"cr"` refers to the cubic spline basis. More options and details via [https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html](https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html). ] --- # Fitting a GAM with {mgcv} .pull-left[ Fitting a GAM with {mgcv} - the Mixed GAM Computation Vehicle with automatic smoothness estimation - essentially goes as follows: ```r model <- gam(accel ~ `s(times, sp = 1.2,` `k = 5, bs = "cr")`, family = gaussian, data = mcycle) ``` or like this ```r model <- gam(accel ~ s(times, bs = "cr"), `method = "REML"`, family = gaussian, data = mcycle) ``` ] .pull-right[ Include a smooth effect of `times` via `s(times)`. `sp = .` specifies a value for the smoothing parameter. `k = .` fixes the number of basis functions. `bs = "cr"` indicates which type of basis functions should be used. Here `"cr"` refers to the cubic spline basis. More options and details via [https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html](https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html). `method = "REML"` will determine an optimal value for the smoothing parameter, using the REML method. More details via [https://www.rdocumentation.org/packages/mgcv/versions/1.8-31/topics/gam](https://www.rdocumentation.org/packages/mgcv/versions/1.8-31/topics/gam). ] --- # Fitting a GAM with {mgcv} (cont.) We then inspect the fitted model via ```r print(model) ``` and access the smoothing parameter with ```r model$sp ``` A visualization of the fitted smoothers is available as ```r plot(model, pages = 1, scheme = 0) plot(model, pages = 1, scheme = 1) ``` where `pages` indicates the number of pages over which to spread the output (with `0` as the default), `scheme = .` specifies the built-in plotting scheme that should be used. More details via [https://www.rdocumentation.org/packages/mgcv/versions/1.8-31/topics/plot.gam](https://www.rdocumentation.org/packages/mgcv/versions/1.8-31/topics/plot.gam). --- # A Poisson GAM .pull-left[ We continue working with `mtpl` and now focus on `ageph`. <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-102-1.svg" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ We will now explore .hi-pink[four different model specifications]: 1. `ageph` as linear effect in `glm` 2. `ageph` as factor variable in `glm` 3. `ageph` split manually into bins using `cut`, then used as factor in `glm` 4. a smooth effect of `ageph` in `mgcv::gam`. Let's go! Grid of observed `ageph` values ```r a <- min(mtpl$ageph):max(mtpl$ageph) ``` ] --- class: clear .pull-left[ .hi-pink[Model 1]: linear effect of `ageph` ```r freq_glm_age <- glm(nclaims ~ `ageph`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_glm_age <- predict(freq_glm_age, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_glm_age <- pred_glm_age$fit l_glm_age <- pred_glm_age$fit - qnorm(0.975)*pred_glm_age$se.fit u_glm_age <- pred_glm_age$fit + qnorm(0.975)*pred_glm_age$se.fit df <- data.frame(a, b_glm_age, l_glm_age, u_glm_age) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-106-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Model 2]: `ageph` as factor variable in `glm` ```r freq_glm_age_f <- glm(nclaims ~ `as.factor(ageph)`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_glm_age_f <- predict(freq_glm_age_f, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_glm_age_f <- pred_glm_age_f$fit l_glm_age_f <- pred_glm_age_f$fit - qnorm(0.975)*pred_glm_age_f$se.fit u_glm_age_f <- pred_glm_age_f$fit + qnorm(0.975)*pred_glm_age_f$se.fit df <- data.frame(a, b_glm_age_f, l_glm_age_f, u_glm_age_f) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-109-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Model 3]: `ageph` split into 5-year bins and then used in `glm` ```r *level <- seq(min(mtpl$ageph), max(mtpl$ageph), by = 5) freq_glm_age_c <- glm(nclaims ~ `cut(ageph, level)`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_glm_age_c <- predict(freq_glm_age_c, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_glm_age_c <- pred_glm_age_c$fit l_glm_age_c <- pred_glm_age_c$fit - qnorm(0.975)*pred_glm_age_c$se.fit u_glm_age_c <- pred_glm_age_c$fit + qnorm(0.975)*pred_glm_age_c$se.fit df <- data.frame(a, b_glm_age_c, l_glm_age_c, u_glm_age_c) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-112-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Model 4]: smooth effect of `ageph` in `mgcv::gam` ```r library(mgcv) freq_gam_age <- gam(nclaims ~ `s(ageph)`, offset = log(expo), data = mtpl, family = poisson(link = "log")) pred_gam_age <- predict(freq_gam_age, newdata = data.frame(ageph = a, expo = 1), `type = "terms"`, se.fit = TRUE) b_gam_age <- pred_gam_age$fit l_gam_age <- pred_gam_age$fit - qnorm(0.975)*pred_gam_age$se.fit u_gam_age <- pred_gam_age$fit + qnorm(0.975)*pred_gam_age$se.fit df <- data.frame(a, b_gam_age, l_gam_age, u_gam_age) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-115-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .hi-pink[Model 4] (revisited): picture smooth effect of `ageph` in `mgcv::gam` with built-in `plot`. ```r library(mgcv) freq_gam <- gam(nclaims ~ s(ageph), offset = log(expo), family = poisson(link = "log"), data = mtpl) plot(freq_gam, scheme = 1) ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-116-1.svg" width="35%" height="10%" style="display: block; margin: auto;" /> --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will further explore GAMs in R with the `gam(.)` function from the {mgcv} package. .hi-pink[Q]: you will combine insights from building `glm` as well as `gam` objects by working through the following coding steps. 1. Fit a `gam` including some factor variables as well as a smooth effect of `ageph` and `bm`. Visualize the fitted smooth effects. 2. Specify risk profiles of drivers. Calculate their expected annual claim frequency from the constructed `gam`. 3. Explain (in words) which profiles would represent high vs low risk according to the constructed model. ] --- class: clear .pull-left[ .hi-pink[Q.1] examine the following `gam` fit ```r freq_gam_2 <- gam(nclaims ~ sex + fuel + use + s(ageph) + s(bm), offset = log(expo), family = poisson(link = "log"), data = mtpl) ``` ```r summary(freq_gam_2) ## ## Family: poisson ## Link function: log ## ## Formula: ## nclaims ~ sex + fuel + use + s(ageph) + s(bm) ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.917801 0.018124 -105.818 <2e-16 *** ## sexmale 0.009177 0.016043 0.572 0.5673 ## fuelgasoline -0.152756 0.015100 -10.116 <2e-16 *** ## usework -0.055156 0.033092 -1.667 0.0956 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(ageph) 5.451 6.502 228.8 <2e-16 *** ## s(bm) 8.490 8.866 1228.1 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0155 Deviance explained = 2.85% ## UBRE = -0.46446 Scale est. = 1 n = 163231 ``` ] .pull-right[ ```r plot(freq_gam_2, select = 1) ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-119-1.svg" width="45%" style="display: block; margin: auto;" /> ```r plot(freq_gam_2, select = 2) ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-120-1.svg" width="45%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ .hi-pink[Q.2] define some risk profiles ```r drivers <- data.frame(expo = c(1, 1, 1), sex = c("female", "female", "female"), fuel = c("diesel", "diesel", "diesel"), use = c("private", "private", "private"), ageph = c(18, 45, 65), bm = c(20, 5, 0)) drivers ``` <table> <thead> <tr> <th style="text-align:right;"> expo </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> fuel </th> <th style="text-align:left;"> use </th> <th style="text-align:right;"> ageph </th> <th style="text-align:right;"> bm </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> diesel </td> <td style="text-align:left;"> private </td> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] .pull-right[ Now, you predict the annual expected claim frequency for these profiles. ```r predict(freq_gam_2, newdata = drivers, type = "response") ``` <table> <thead> <tr> <th style="text-align:right;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.3969310 </td> </tr> <tr> <td style="text-align:right;"> 0.1727430 </td> </tr> <tr> <td style="text-align:right;"> 0.0951124 </td> </tr> </tbody> </table> ] --- # Smooth functions of several variables .pull-left[ We now focus on estimating smoothers of .KULbginline[two input variables]. We first put focus on estimating `\(f(x,y)\)` via .KULbginline[thin plate splines]. This is an example of a so-called isotropic smooth, used when `\(x\)` and `\(y\)` are at the same scale. These represent `\(f(x,y)\)` as `\(\sum_{n=1}^N \gamma_{n} \cdot \tilde{b}_{n}(x,y)\)`. To avoid overfitting a wiggliness penalty is added to the likelihood: `$$\int \int \left( \frac{\partial^2 f_j}{ (\partial x)^2 }\right)^2 + 2 \left( \frac{ \partial^2 f_j }{ \partial x \partial y } \right)^2 + \left( \frac{\partial^2 f_j}{ (\partial y)^2 } \right)^2 dx dy \\ = \gamma_j^{t} \boldsymbol{T}_j \gamma_j.$$` ] .pull-right[ In {mgcv} we fit the spatial effect using a thin plate spline of `lat` and `long`. ```r freq_gam_spatial <- gam(nclaims ~ s(long, lat, bs = "tp"), offset = log(expo), family = poisson(link = "log"), data = mtpl) freq_gam_spatial$sp ## s(long,lat) ## 0.858939 ``` The output shows that one smoothing parameter is used. The built-in plotting function `plot(freq_gam_spatial, scheme = 2)` reveals the structure of Belgium (a bit). ] --- # Smooth functions of several variables (cont.) .pull-left[ .hi-pink[Tensor product smooths] are typically used when variables are measured on different scale. We start by choosing marginal bases and penalties, as if constructing 1-D smooths of `\(x\)` and `\(y\)`: `$$f(x)=\sum \alpha_i a_i(x), \quad \quad f(z)=\sum \beta_j b_j(z),$$` with the usual penalties. The tensor product basis construction then gives: `$$f(x,y) = \sum \sum \beta_{ij} b_j(z) a_i(x),$$` with double penalties and .hi-pink[two smoothing parameters]. ] -- .pull-right[ We include an interaction effect (via `ti(., .)`) next to the individual, univariate effects. ```r freq_gam_inter <- gam(nclaims ~ s(ageph) + s(power) + ti(ageph, power, bs = "tp"), offset = log(expo), family = poisson(link = "log"), data = mtpl) freq_gam_inter$sp ## s(ageph) s(power) ti(ageph,power)1 ti(ageph,power)2 ## 1.73015996 0.08216707 654.63175948 0.72476478 ``` The built-in plotting function `plot(freq_gam_inter, scheme = 2, select = 3)` reveals the structure of the fitted interaction effect. ] --- # Visualizing the fitted spatial effect .pull-left[ To improve the visualisation of the fitted spatial effect, we extract the fitted `s(lat, long)` for each postal code stored in the `sf` object `belgium_shape_sf`. First, we retrieve the coordinates of the postal code centroids, using `st_centroid(.)` applied to an `sf` object. ```r post_dt <- st_centroid(belgium_shape_sf) post_dt$long <- do.call(rbind, post_dt$geometry)[,1] post_dt$lat <- do.call(rbind, post_dt$geometry)[,2] ``` ] -- .pull-right[ Next, we evaluate the spatial smoother stored in `freq_gam_spatial` in each of the postal codes, using the `predict(.)` function of the GAM. ```r pred <- predict(freq_gam_spatial, newdata = post_dt, type = "terms", terms = "s(long,lat)") ``` ```r dt_pred <- data.frame(pc = post_dt$POSTCODE, long = post_dt$long, lat = post_dt$lat, pred) names(dt_pred)[4] <- "fit_spatial" ``` Finally, we merge `dt_pred` and `belgium_shape_sf`. ```r belgium_shape_sf <- left_join(belgium_shape_sf, dt_pred, by = c("POSTCODE" = "pc")) ``` ] --- class: clear .pull-left[ We plot the fitted spatial effect on a map, using the tools discussed earlier on. Mind the `scale_fill_gradient(.)` to be used with a continuous variable for `fill = .`. ```r ggplot(belgium_shape_sf) + geom_sf(aes(fill = fit_spatial), colour = NA) + ggtitle("MTPL claim frequency data") + scale_fill_gradient(low = "#99CCFF", high = "#003366") + theme_bw() ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-132-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ Using the tools from the {tmap} package ```r tm_shape(belgium_shape_sf) + tm_borders(col = 'white', lwd = .1) + tm_fill("fit_spatial", style = "cont", palette = "RdBu", legend.reverse = TRUE, auto.palette.mapping = TRUE) + tm_layout(legend.title.size = 1.0, legend.text.size = 1.0) ``` For use later on we remove the `fit_spatial` variable from `belgium_shape_sf`: ```r library(dplyr) belgium_shape_sf <- belgium_shape_sf %>% dplyr::select(-fit_spatial) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-135-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- # Putting it all together .pull-left[ We are now ready to fit the preferred GAM from Henckaerts et al. (SAJ, 2018). The proposed selection of variables is based on an .hi-pink[exhaustive search] over all possible combinations of variables. ```r freq_gam <- gam(nclaims ~ coverage + fuel + s(ageph) + s(bm) + s(power) + s(long, lat) + ti(ageph, power, bs = "tp"), offset = log(expo), data = mtpl, family = poisson(link = "log")) ``` ] .pull-right[ ```r summary(freq_gam) ## ## Family: poisson ## Link function: log ## ## Formula: ## nclaims ~ coverage + fuel + s(ageph) + s(bm) + s(power) + s(long, ## lat) + ti(ageph, power, bs = "tp") ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.9745911 0.0216710 -91.117 < 2e-16 *** ## coveragePO 0.0008931 0.0239860 0.037 0.97 ## coverageTPL 0.1221058 0.0221623 5.510 3.6e-08 *** ## fuelgasoline -0.1817652 0.0158127 -11.495 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(ageph) 5.709 6.754 295.74 < 2e-16 *** ## s(bm) 6.458 7.373 991.37 < 2e-16 *** ## s(power) 6.393 7.282 107.33 < 2e-16 *** ## s(long,lat) 25.754 28.243 418.77 < 2e-16 *** ## ti(ageph,power) 1.731 2.154 15.88 0.000464 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0196 Deviance explained = 3.5% ## UBRE = -0.46764 Scale est. = 1 n = 163231 ``` Using the plotting functionalities in {mgcv} we can picture the estimated smoothers via `plot(freq_gam, page = 1, scheme = 2)`. ] --- class: inverse, center, middle name: from-gam-to-glm # From GAM to GLM using clustering and tree methods --- # Binning the spatial effect .pull-left[ We continue working with `post_dt`, a data set with the 1 146 Belgian postal codes, their corresponding `lat` and `long` (of the center of the postal code). We want to use the `predict` function of the GAM stored in `freq_gam` on the data set `post_dt`. Hereto, we need to extend `post_dt` with values for the other covariates used by `freq_gam`. ```r post_dt$coverage <- mtpl$coverage[1] post_dt$fuel <- mtpl$fuel[1] post_dt[c("bm", "ageph", "power", "expo")] <- c(mtpl$bm[1], mtpl$ageph[1], mtpl$power[1], mtpl$expo[1]) ``` ] .pull-right[ We will use `\(\hat{f}(\texttt{long}, \texttt{lat})\)` from the `freq_gam` object. We create a new data set `dt_pred`, with the postal codes, the coordinates of their centroids and the fitted spatial effect. ```r pred <- predict(freq_gam, newdata = post_dt, type = "terms", terms = "s(long,lat)") dt_pred <- tibble(pc = post_dt$POSTCODE, long = post_dt$long, lat = post_dt$lat, pred) names(dt_pred)[4] <- "fit_spatial" ``` ```r dt_pred <- dplyr::arrange(dt_pred, pc) ``` ] --- # Binning the spatial effect (cont.) .pull-left[ We use the fitted spatial effect `\(s_i := \hat{f}(\texttt{long}_i,\texttt{lat}_i)\)` for every `\(i = 1,\ldots,1\ 146\)`. We .KULbginline[split or bin] using one of the following: (also see the {classint} package): * equal intervals: `\(k\)` bins of equal length `\(\frac{\max{(s_i)}-\min{(s_i)}}{k}\)` * quantile binning: each bin contains `\(\approx \frac{1\ 146}{k}\)` municipalities * complete linkage (i.e. hierarchical clustering) * Fisher's natural breaks. The goal is to construct .KULbginline[homogeneous] class intervals (or bins). ] .pull-right[ With Fisher's natural breaks (i.e. `\(K\)`-means clustering): * minimize the sum of squared distances between observations `\(s_u^{(l)}\)` and the bin means `$$\sum_{l=1}^K \sum_{u=1}^{n_l} (s_u^{(l)}-\bar{s}^{(l)})^2,$$` with `\(K\)` the number of bins. ] --- class: clear .pull-left[ In R we use the `classIntervals` function ```r num_bins <- 5 library(classInt) classint_fisher <- classIntervals( dt_pred$fit_spatial, num_bins, style = "fisher") classint_fisher$brks min(dt_pred$fit_spatial) max(dt_pred$fit_spatial) ``` ] .pull-right[ We fix the number of bins upfront and store it as `num_bins`. This is a tuning parameter, and we discuss a tuning strategy later on. ] --- class: clear .pull-left[ ```r num_bins <- 5 library(classInt) classint_fisher <- classIntervals( `dt_pred$fit_spatial`, `num_bins`, `style = "fisher"`) classint_fisher$brks min(dt_pred$fit_spatial) max(dt_pred$fit_spatial) ``` ] .pull-right[ In R we use the `classInt::classIntervals` function. We specify the fitted values of the spatial smoother, stored in `dt_pred$fit_spatial`, the number of bins and the clustering technique. We inspect the constructed bins, stored in `classint_fisher`: ```r classint_fisher$brks ## [1] -0.47852103 -0.27012356 -0.14106325 -0.03575633 0.10961949 0.34051818 min(dt_pred$fit_spatial) ## [1] -0.478521 max(dt_pred$fit_spatial) ## [1] 0.3405182 ``` ] --- class: clear .pull-left[ We picture the binned spatial effect using the `plot` function that comes with the {classInt} package. ```r crp <- colorRampPalette(c("#99CCFF", "#003366")) plot(classint_fisher, crp(num_bins), xlab = expression(hat(f)(long,lat)), main = "Fisher") ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-145-1.svg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ We merge the `fit_spatial` effect with the `belgium_shape_sf` object, per postal code. ```r belgium_shape_sf <- left_join(belgium_shape_sf, dt_pred, by = c("POSTCODE" = "pc")) ``` Using the `cut(.)` function we bin the `fit_spatial` effect using the break points stored in `classint_fisher`. ```r belgium_shape_sf$class_fisher <- cut(belgium_shape_sf$fit_spatial, breaks = classint_fisher$brks, right = FALSE, include.lowest = TRUE, dig.lab = 2) ``` ] --- class: clear .pull-left[ As a final step, we visualize the constructed clusters on the map of Belgium. ```r ggplot(belgium_shape_sf) + theme_bw() + labs(fill = "Fisher") + geom_sf(aes(fill = class_fisher), colour = NA) + ggtitle("MTPL claim frequency data") + scale_fill_brewer(palette = "Blues", na.value = "white") + theme_bw() ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-149-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will further explore the clustering methods in the `classIntervals(.)` function from the {classInt} package. .hi-pink[Q]: rerun the coding steps discussed above 1. use a different `style = .` argument. Search the `? classIntervals` for other options. 2. visualize the resulting clusters on the cdf of the fitted spatial smoother. 3. visualize the resulting clusters on the map of Belgium. ] --- # Tuning the number of spatial clusters We now construct a new data set where the spatial effect is replaced by a .KULbginline[binned version]. The binnend effect is stored as `mtpl_geo$geo`. ```r library(dplyr) mtpl_geo <- mtpl %>% dplyr::select(nclaims, expo, coverage, fuel, ageph, bm, power, pc) mtpl_geo <- left_join(mtpl_geo, dt_pred) mtpl_geo$geo <- as.factor(cut(mtpl_geo$fit_spatial, breaks = classint_fisher$brks, right = FALSE, include.lowest = TRUE, dig.lab = 2)) head(mtpl_geo$geo) ## [1] [0.11,0.34] [0.11,0.34] [0.11,0.34] [0.11,0.34] [0.11,0.34] [0.11,0.34] ## 5 Levels: [-0.48,-0.27) [-0.27,-0.14) [-0.14,-0.036) ... [0.11,0.34] ``` On this new data set, we fit the GAM, with the factor variable `geo` replacing the spatial smoother `s(long, lat)`. --- class: clear To tune the number of spatial bins, we store the AIC/BIC of `freq_gam_geo`. We repeat the calculation over a .KULbginline[grid of values] for `num_bins`. The optimal number of bins is chosen by minimizing the AIC/BIC. ```r freq_gam_geo <- gam(nclaims ~ coverage + fuel + s(ageph) + s(bm) + s(power) + ti(ageph, power, bs = "tp") + geo, offset = log(expo) , data = mtpl_geo, family = poisson(link = "log")) summary(freq_gam_geo) ## ## Family: poisson ## Link function: log ## ## Formula: ## nclaims ~ coverage + fuel + s(ageph) + s(bm) + s(power) + ti(ageph, ## power, bs = "tp") + geo ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.271692 0.055518 -40.918 < 2e-16 *** ## coveragePO -0.004255 0.023915 -0.178 0.85879 ## coverageTPL 0.118676 0.022002 5.394 6.89e-08 *** ## fuelgasoline -0.179961 0.015758 -11.420 < 2e-16 *** ## geo[-0.27,-0.14) 0.122503 0.055900 2.191 0.02842 * ## geo[-0.14,-0.036) 0.172135 0.054432 3.162 0.00156 ** ## geo[-0.036,0.11) 0.328476 0.053061 6.191 5.99e-10 *** ## geo[0.11,0.34] 0.530189 0.054194 9.783 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(ageph) 5.663 6.709 295.74 < 2e-16 *** ## s(bm) 6.708 7.585 1019.65 < 2e-16 *** ## s(power) 6.409 7.296 111.05 < 2e-16 *** ## ti(ageph,power) 1.796 2.265 14.96 0.000856 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0195 Deviance explained = 3.46% ## UBRE = -0.46769 Scale est. = 1 n = 163231 ``` --- # Binning the univariate smooth effects When binning the continuous risk factors, we aim to construct consecutive intervals. The clustering methods used to bin the spatial effect do not guarantee consecutive intervals. We replace them with .KULbginline[tree-based methods]. The tree will minimize the following MSE: `$$\frac{\sum_{i=\min(\texttt{ageph})}^{\max(\texttt{ageph})}w_{\texttt{ageph}_i}(\hat{f}(\texttt{ageph}_i)-\hat{f}^{b}(\texttt{ageph}_i))^2 }{\sum_{i=\min(\texttt{ageph})}^{\max(\texttt{ageph})}w_{\texttt{ageph}_i}}.$$` Here, `\(\hat{f}^{b}(\texttt{ageph}_i)\)` is a binned approximation of `\(\hat{f}(\texttt{ageph}_i)\)`, obtained by fitting a constant per constructed bin. To balance goodness-of-fit and complexity, the `evtree` minimizes `$$n \cdot \text{MSE} + \alpha \cdot \text{complexity},$$` where `\(\alpha\)` is a tuning parameter. --- # Binning the univariate smooth effects (cont.) Starting from the GAM stored in `freq_gam_geo` we extract a data set with the unique values of a continous risk factor, the number of records observed for each value, and the fitted smoother evaluated in these values. ```r getGAMdata_single = function(model, term, var, varname){ pred <- predict(model, type = "terms", terms = term) dt_pred <- tibble("x" = var, pred) dt_pred <- arrange(dt_pred, x) names(dt_pred) = c("x", "s") dt_unique <- unique(dt_pred) dt_exp <- dt_pred %>% group_by(x) %>% summarize(tot = n()) dt_exp <- dt_exp[c("x", "tot")] GAM_data <- left_join(dt_unique, dt_exp) names(GAM_data) <- c(varname, "s", "tot") GAM_data <- GAM_data[which(GAM_data$tot != 0), ] return(GAM_data) } freq_gam_ageph <- getGAMdata_single(freq_gam_geo, "s(ageph)", mtpl_geo$ageph, "ageph") ``` --- # Binning the univariate smooth effects (cont.) .pull-left[ We put focus on `\(\hat{f}(\texttt{ageph})\)` and use (evolutionary) trees to bin the fitted effect, by making splits on `ageph` only. ```r library(evtree) source("./scripts/evtree.R") ``` ```r ctrl.freq <- evtree.control( minbucket = 0.05*nrow(mtpl), alpha = 550, maxdepth = 5) ``` Notice the `weights` argument in the `evtree` function, using the number of policyholders with a particular `ageph` as a weight. ```r evtree_freq_ageph <- `evtree`(s ~ ageph, data = freq_gam_ageph, `weights = tot`, control = ctrl.freq) evtree_freq_ageph ``` ] .pull-right[ ``` ## ## Model formula: ## s ~ ageph ## ## Fitted party: ## [1] root ## | [2] ageph < 61 ## | | [3] ageph < 29 ## | | | [4] ageph < 26: * ## | | | [5] ageph >= 26: * ## | | [6] ageph >= 29 ## | | | [7] ageph < 56 ## | | | | [8] ageph < 33: * ## | | | | [9] ageph >= 33 ## | | | | | [10] ageph < 51: * ## | | | | | [11] ageph >= 51: * ## | | | [12] ageph >= 56: * ## | [13] ageph >= 61 ## | | [14] ageph < 73: * ## | | [15] ageph >= 73: * ## ## Number of inner nodes: 7 ## Number of terminal nodes: 8 ``` ] --- # Binning the univariate smooth effects (cont.) .pull-left[ We put focus on `\(\hat{f}(\texttt{ageph})\)` and use (evolutionary) trees to bin the fitted effect, by making splits on `ageph` only. ```r library(evtree) source("./scripts/evtree.R") ``` ```r ctrl.freq <- evtree.control( minbucket = 0.05*nrow(mtpl), alpha = 550, maxdepth = 5) ``` Notice the `weights` argument in the `evtree` function, using the number of policyholders with a particular `ageph` as a weight. ```r evtree_freq_ageph <- `evtree`(s ~ ageph, data = freq_gam_ageph, `weights = tot`, control = ctrl.freq) plot(evtree_freq_ageph) ``` ] .pull-right[ <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-160-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will now examine the splits created for the other smooth effects of continuous risk factors. .hi-pink[Q]: you will work through the following exploratory steps. 1. Split the smooth effects of `bm` and `power` using `evtree`. 2. Visualize the constructed trees. 3. What would (conceptually) be your strategy to split the interaction effect `ti(ageph, power, bs = "tp")`. <br>
10
:
00
] --- class: clear .pull-left[ To extract the split points from the tree model stored in `evtree_freq_ageph` we use the following helper function ```r splits_evtree = function(evtreemodel, GAMvar, DTvar){ preds <- predict(evtreemodel, type = "node") nodes <- data.frame("x" = GAMvar, "nodes" = preds) nodes$change <- c(0, pmin(1, diff(nodes$nodes))) splits_evtree <- unique(c(min(DTvar), nodes$x[which(nodes$change==1)], max(DTvar))) return(splits_evtree) } ``` ] .pull-right[ ```r freq_splits_ageph <- splits_evtree( evtree_freq_ageph, freq_gam_ageph$ageph, mtpl$ageph) freq_splits_ageph ## [1] 18 26 29 33 51 56 61 73 95 ``` ] --- class: clear .pull-left[ To compare the fitted smooth effect with the constructed bins of `ageph` we use ```r ggplot.gam <- function(model, variable, gam_term, xlabel, ylabel){ pred <- predict(model, type = "terms", se = TRUE) col_index <- which(colnames(pred$fit)==gam_term) x <- variable b <- pred$fit[, col_index] l <- pred$fit[, col_index] - qnorm(0.975) * pred$se.fit[, col_index] u <- pred$fit[, col_index] + qnorm(0.975) * pred$se.fit[, col_index] df <- unique(data.frame(x, b, l, u)) p <- ggplot(df, aes(x = x)) p <- p + geom_line(aes(y = b), size = 1, col = "#003366") p <- p + geom_line(aes(y = l), size = 0.5, linetype = 2, col = "#99CCFF") p <- p + geom_line(aes(y = u), size = 0.5, linetype = 2, col = "#99CCFF") p <- p + xlab(xlabel) + ylab(ylabel) + theme_bw() p } ``` ] .pull-right[ ```r plot_freq_bin_ageph <- ggplot.gam(freq_gam_geo, mtpl_geo$ageph, "s(ageph)", "ageph", expression(hat(f)(ageph))) plot_freq_bin_ageph <- plot_freq_bin_ageph + geom_vline(xintercept = freq_splits_ageph[2:(length(freq_splits_ageph)-1)]) plot_freq_bin_ageph ``` <img src="pricing_analytics_with_GAMs_and_GLMs_files/figure-html/unnamed-chunk-165-1.svg" width="60%" style="display: block; margin: auto;" /> ] --- # Putting it all together ```r mtpl_bin <- mtpl_geo[c("nclaims", "expo", "coverage", "fuel", "geo")] mtpl_bin$ageph <- cut(mtpl_geo$ageph, freq_splits_ageph, right = FALSE, include.lowest = TRUE) summary(mtpl_bin$ageph) ## [18,26) [26,29) [29,33) [33,51) [51,56) [56,61) [61,73) [73,95] ## 8258 9265 13939 69055 15674 12591 25323 9126 ``` ```r summary(mtpl_bin$coverage) ## FO PO TPL ## 22107 45988 95136 summary(mtpl_bin$fuel) ## diesel gasoline ## 50398 112833 summary(mtpl_bin$geo) ## [-0.48,-0.27) [-0.27,-0.14) [-0.14,-0.036) [-0.036,0.11) [0.11,0.34] ## 4122 21809 34241 71374 31685 ``` ```r mtpl_bin$ageph <- relevel(mtpl_bin$ageph, ref = "[33,51)") mtpl_bin$geo <- relevel(mtpl_bin$geo, ref = "[-0.036,0.11)") mtpl_bin$coverage <- relevel(mtpl_bin$coverage, ref = "TPL") mtpl_bin$fuel <- relevel(mtpl_bin$fuel, ref = "gasoline") ``` --- class: clear ```r freq_glm <- gam(nclaims ~ coverage + fuel + ageph + geo, offset = log(expo), data = mtpl_bin, family = poisson(link = "log")) summary(freq_glm) ## ## Family: poisson ## Link function: log ## ## Formula: ## nclaims ~ coverage + fuel + ageph + geo ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.95021 0.01555 -125.444 < 2e-16 *** ## coverageFO -0.13001 0.02159 -6.022 1.73e-09 *** ## coveragePO -0.15759 0.01666 -9.458 < 2e-16 *** ## fueldiesel 0.17609 0.01503 11.714 < 2e-16 *** ## ageph[18,26) 0.61163 0.02620 23.340 < 2e-16 *** ## ageph[26,29) 0.37585 0.02715 13.846 < 2e-16 *** ## ageph[29,33) 0.19884 0.02457 8.092 5.88e-16 *** ## ageph[51,56) -0.09561 0.02599 -3.678 0.000235 *** ## ageph[56,61) -0.21004 0.02995 -7.013 2.33e-12 *** ## ageph[61,73) -0.32697 0.02355 -13.887 < 2e-16 *** ## ageph[73,95] -0.29090 0.03590 -8.102 5.39e-16 *** ## geo[-0.48,-0.27) -0.36796 0.05303 -6.938 3.97e-12 *** ## geo[-0.27,-0.14) -0.23224 0.02323 -9.996 < 2e-16 *** ## geo[-0.14,-0.036) -0.17616 0.01942 -9.070 < 2e-16 *** ## geo[0.11,0.34] 0.25816 0.01810 14.259 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## R-sq.(adj) = 0.0125 Deviance explained = 2.21% ## UBRE = -0.46097 Scale est. = 1 n = 163231 ``` --- class: clear ```r freq_glm <- gam(nclaims ~ coverage + fuel + ageph + geo, offset = log(expo), data = mtpl_bin, family = poisson(link = "log")) anova(freq_glm) ## ## Family: poisson ## Link function: log ## ## Formula: ## nclaims ~ coverage + fuel + ageph + geo ## ## Parametric Terms: ## df Chi.sq p-value ## coverage 2 103.7 <2e-16 ## fuel 1 137.2 <2e-16 ## ageph 7 1326.1 <2e-16 ## geo 4 587.9 <2e-16 ``` --- name: yourturn class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ You will now extend the GLM obtained so far with factor effects of `bm` and `power`. .hi-pink[Q]: you will work through the following exploratory steps. 1. Extract the split points constructed for the smooth effects of `bm` and `power`. 2. Split the continuous variables `bm` and `power` using these split points. 3. Extend the GLM with the constructed factor variables. Pick an appropriate reference level. <br>
10
:
00
] --- # That's a wrap! [A data driven binning strategy for the construction of insurance tariff classes](https://katrienantonio.github.io/projects/2019/06/13/machine-learning/) by Henckaerts, Antonio, Clijsters & Verbelen (2018). <center> <img src="img/data_driven.png" width="850"/> </center> --- # What else is there? [Sparse regression with multi-type feature modelling](https://arxiv.org/abs/1810.03136) by Devriendt, Antonio et al. (2018) * automatic feature selection and binning of risk factors via .hi-pink[regularization] * [R package {smurf}](https://cran.r-project.org/web/packages/smurf/index.html) on CRAN * end product is a GLM! -- [Boosting insights in insurance tariff plans with tree-based machine learning](https://arxiv.org/abs/1904.10890) by Henckaerts, Côté, Antonio & Verbelen (2020), North American Actuarial Journal. * GLMs, GAMs, decision trees, random forests and gradient boosting machines * R packages extended to Poisson and gamma deviance * tuning strategy, interpretability tools and managerial insights * supplementary material on [GitHub](https://github.com/henckr). --- # Thanks! <img src="img/xaringan.png" class="title-hex"> <br> <br> <br> <br> Slides created with the R package [xaringan](https://github.com/yihui/xaringan). <br> <br> <br> Course material available via <br> <svg style="height:0.8em;top:.04em;position:relative;fill:#116E8A;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> https://github.com/katrienantonio/PE-pricing-analytics